De Novo Structural Variations of Escherichia coli Detected by Nanopore Long-Read Sequencing

Abstract Spontaneous mutations power evolution, whereas large-scale structural variations (SVs) remain poorly studied, primarily because of the lack of long-read sequencing techniques and powerful analytical tools. Here, we explore the SVs of Escherichia coli by running 67 wild-type (WT) and 37 mismatch repair (MMR)–deficient (ΔmutS) mutation accumulation lines, each experiencing more than 4,000 cell divisions, by applying Nanopore long-read sequencing and Illumina PE150 sequencing and verifying the results by Sanger sequencing. In addition to precisely repeating previous mutation rates of base-pair substitutions and insertion and deletion (indel) mutation rates, we do find significant improvement in insertion and deletion detection using long-read sequencing. The long-read sequencing and corresponding software can particularly detect bacterial SVs in both simulated and real data sets with high accuracy. These lead to SV rates of 2.77 × 10−4 (WT) and 5.26 × 10−4 (MMR-deficient) per cell division per genome, which is comparable with previous reports. This study provides the SV rates of E. coli by applying long-read sequencing and SV detection programs, revealing a broader and more accurate picture of spontaneous mutations in bacteria.


Introduction
Spontaneous mutations occur in all living organisms and are the primary source of genetic variation. Common types of mutations are base-pair substitutions (BPSs), small insertions and deletions (indels), and large-scale structural variations (SVs). Most previous studies have focused primarily on BPSs and small indels due to sequencing technology limitations Lynch et al. 2016;Long et al. 2018b;Pan et al. 2022). Although neglected or unresolved, early studies have found that many human diseases are associated with SVs. For example, duplication fragments of human chromosome 17p lead to Charcot-Marie-Tooth disease type 1A, and large homozygous deletions of the 2p13 region result in juvenile nephronophthisis (Lupski et al. 1991;Konrad et al. 1996). SVs also play essential roles in genome evolution: some beneficial SVs may help organisms adapt to their environments, and some copy number variant-dominated SVs are positively selected with higher frequencies (Emerson et al. 2008;Iskow et al. 2012;Kondrashov 2012). Differences in large-effect SVs of genes controlling specific traits at the population level imply that SVs may be associated with the formation of new species (Chan et al. 2010). Because most bacterial genomes are haploid, the fitness effects of SVs in bacteria are even more significant than those in humans. SVs have a profound impact on the evolution of bacteria, particularly for many pathogenic species, where the pathogenicity or new virulence phenotypes are associated with SV-carrying critical genes that are frequently caused by transposition or recombination events (Lieberman et al. 2011;Damkiaer et al. 2013;Lee et al. 2016).
Previous studies have detected SVs mostly by using short paired-end reads (Ye et al. 2009;Iqbal et al. 2012;Rausch et al. 2012; Barrick et al. 2014;Deatherage and Barrick 2014;Fan et al. 2014;Layer et al. 2014;Chen et al. 2016;Lee et al. 2016;Tian et al. 2018). Such strategy has played a key role in the identification of SVs, revealing their diversity in individuals to population (Ma et al. 2021;Zhao et al. 2021a;Chen et al. 2022). Based on such analytical strategies, E. coli insertion sequence (IS) elements were reported to have an insertion rate of 3.5 × 10 −4 and a recombination rate of 4.5 × 10 −5 per genome per generation, and the transposition rate in E. coli measured by other methods was about 10 −5 (Sousa et al. 2013;Lee et al. 2016). However, the accuracy of such explorations may be affected by the inherent defects of short-read sequencing (Putze et al. 2009;Lee et al. 2014;Mahmoud et al. 2019). In contrast, the combination of long-read sequencing and more advanced bioinformatics tools can provide unique anchors in the repeat regions of the reference genome and achieve better results for identifying breakpoints and more types of SVs (Cretu Stancu et al. 2017;Mahmoud et al. 2019). Such strategy has been greatly optimized for identifying SVs in complex and nested sequences or low-depth sequencing data (Sedlazeck et al. 2018;Tham et al. 2020). Consequently, long-read sequencing provides a more complete and precise view of de novo spontaneous mutations at all scales, although such trials are rarely performed.
Mutation accumulation (MA) combined with wholegenome sequencing is the most classical strategy for determining the rate and spectrum of spontaneous mutations (Foster 2006;Lee et al. 2012). Single-individual transfers repeatedly bottleneck large sets of parallel lines, so that genetic drift dominates selection, and even deleterious mutations can be accumulated, eventually providing nearly unbiased mutational features. MA of DNA mismatch repair (MMR) defective strains can further provide an accurate picture of mutations before the specific repairing of MMR (Iyer et al. 2006;Lee et al. 2012;Long et al. 2016;Long et al. 2018a). In this study, we tested and identified better strategies using in silico simulation, MA of wild-type (WT) and MMR-deficient E. coli K-12 MG1655, and Nanopore longread and Illumina PE150 sequencing for analyzing bacterial SVs.

Results
To detect SVs in the E. coli K-12 MG1655 genome, we accumulated de novo mutations by daily single-colony streaking 80 WT MA lines and 40 MMR-defective (ΔmutS) lines from 1 WT ancestor cell and 1 ΔmutS ancestor cell, respectively. Eventually, 67 WT and 37 ΔmutS MA lines were used for the final analysis after removing low-coverage, crosscontaminated lines or those with mutations falling in other repair systems (supplementary table S1, Supplementary Material online). Each WT MA line experienced about 4,480 cell divisions and was sequenced to a mean depth of coverage 99× (standard error, SE: 5.56) and 4,320 cell divisions and 123× (SE: 9.34) for the ΔmutS MA lines. More than 99% of the genomes of all the MA lines were covered with high-quality reads (supplementary tables S2 and S3, Supplementary Material online). We also performed Nanopore long-read sequencing on 19 WT and 18 ΔmutS MA lines as well as their ancestors (1 ΔmutS line was removed due to 3 mutations in the repair gene mutT) with ∼1 Gbp to 3 Gbp for each line (supplementary  table S4, Supplementary Material online). The features of BPSs and small indels are highly consistent with previous studies, confirming the high repeatability of the E. coli mutation-accumulation experiments (supplementary file S1, figs. S1 and S2, and tables S2, S3, and S5-S13, Supplementary Material online).

Evaluating the SV Detection Pipelines with Simulated Data
We first evaluated the reliability of the widely used SV detection pipelines by running them on simulated short-read and long-read data sets with mock mutation preset (see details in Materials and Methods). For the simulated shortread data set, breseq (v-0.35.1) performs the best for analyzing deletions, with sensitivity and precision both close to 100% (table 1, fig. 1, and supplementary tables S14 and S15, Supplementary Material online). Considering that breseq is mainly used to identify deletions and insertions mediated by mobile elements, we also used Manta (v-1.6.0) to detect other SVs besides deletions, such as insertions, tandem duplications, and inversions. The analysis achieved satisfying results for the precision of tandem duplications and sensitivity of inversions (tables 1 and 2, fig. 1, and supplementary tables S14 and S15, Supplementary Material online). Similarly, for the simulated long-read data set, Sniffles (v-1.0.12) was chosen because it outperformed other programs in SV detection, as shown in the testing results of different SV callers (supplementary tables S15 and S16, Supplementary Material online) (Sedlazeck et al. 2018;Liu et al. 2020;Okazaki et al. 2022), especially for deletions and tandem duplications ( fig. 1, tables 1 and 2, and supplementary tables S14, and S15, Supplementary Material online). SV analyses on simulated data show that breseq detects deletions with high sensitivity and precision, Manta performs ideally on other SV types with short reads as input, and Sniffles is appropriate for detecting SVs using long-read sequencing (table 1, fig. 1, and supplementary tables S14 and S15, Supplementary Material online). The SV results from long-read sequencing are more reliable than those from short-read sequencing, as shown by the universally high F1 scores of most types of SVs (table 1 and supplementary table S17, Supplementary Material online), which is consistent with previous studies (Merker et al. 2018;Lesack et al. 2022).
We also find that the number of SVs of certain types in the genome can affect the performance of the software to some extent. For example, using the short-read pipeline, sensitivity tends to increase with more tandem duplications, whereas for the long-read pipeline, the increase of inversion will greatly reduce the sensitivity and precision ( fig.  1). Besides, we also note that even short-read sequencing can give highly reliable results for deletions and short inversions in simulated genomes. We then finalize the pipelines and use them on the Illumina and Nanopore sequences of the MA lines we ran.
In addition, to ensure the transferability of the analysis pipeline for the simulated data, we similarly set up and analyzed the 0-variant mock genome. Based on the same short-read and long-read analysis pipelines, we did not detect any SVs, which confirmed the reliability of our pipelines.  (table 3). Compared with the total number of SVs from the shortread pipelines, those detected by the Nanopore sequencing pipelines are small because only part of the MA lines were randomly chosen for costs concern. Consistent with the results from simulated data, the high validation rate and number of SVs from the Nanopore data demonstrate the superiority of long-read sequencing in SV detection. This is in strong contrast to the ultra-high false-positive rate of inversions and tandem duplications from short-read sequencing ( fig. 2A). Nonetheless, the precision for Sniffles detecting insertions remains low (8.24% for WT and 6.00% for ΔmutS), even with the long-read strategy (supplementary table S17, Supplementary Material online). In addition, we also find that the medium-and long-length SVs, especially the insertions and deletions, are preferably detected, whereas the false-positive rate of the short SVs is relatively high (fig. 2B and 2C and supplementary tables S19 and S20, Supplementary Material online). Specifically, for SVs with different length ranges, the false-positive rates based on the short-read strategy are higher than those from the long-read strategy, especially for short and long SVs ( fig. 2B and 2C). Finally, we combine SV results based on the 2 sequencing platforms and find that 83 out of 146 and 82 out of 133 SVs are validated in the WT and the ΔmutS MA lines, respectively (supplementary tables S19 and S20, Supplementary Material online). The number of SVs per WT or ΔmutS line, after combining SV results from the short-read and long-read strategies, is 1.24 or 2.22. The vast majority of these true-positive SVs are shorter than 1,500 bp in E. coli ( fig. 2D and 2E).
Based on the verified SVs, we calculate the genomic SV rate of the WT E. coli to be 2.77 × 10 −4 per genome per cell division (95% CI: 2.95-4.34 × 10 −4 ) and 5.26 × 10 −4 per genome per cell division for the ΔmutS (95% CI: 7.37-10.34 × 10 −4 ), with significant difference between the SV rates of the 2 strains-a sign of MMR influencing the major types of SVs (supplementary tables S21 and S22, Supplementary Material online). The WT SV rate is lower but still comparable with those large chromosomal rearrangements of E. coli reported in previous studies implying a low false-positive rate of the sequencing and analytical pipelines (also confirmed by the above analyses on the simulated data sets) (Raeside et al. 2014). We calculate the BPS rates of the WT and the ΔmutS to be 9.00 × 10 −4 and 8.12 × 10 −2 per genome per cell division, respectively. The SV rates are thus ∼31% and 0.65% of the BPSs rates for the 2 strains, respectively, consistent with previous findings that large-scale mutations are usually less abundant than the small mutations (Pang et al. 2010 fig. 2F, and supplementary tables S19 and S20, Supplementary Material online). Such insertion bias of SVs is different from the deletion bias of small indels previously reported (Kuo and Ochman 2009;Lee et al. 2012;Long et al. 2016;Danneels et al. 2018;Long et al. 2018a;Loewenthal et al. 2021). One previous study on SVs of the same E. coli WT strain found that IS-mediated insertions were more common than deletions (Lee et al. 2016). However, the bias is reversed by the SVs length in the ΔmutS MA lines, as the total length of deletions is about 2.78 times higher than that of the insertions (supplementary tables S19 and S20, Supplementary Material online). Consistent with small indels, this deletion bias in DNA length could be related to the genomic contraction in bacteria, especially for those hosted in other organisms (Gregory 2004;Merhej et al. 2009;Bobay and Ochman 2017). Besides, we also analyzed the distribution of SVs along the chromosome. For the WT, the distribution of insertions in the genome is approximate to uniform distribution, and the deletions mainly cluster in 0-0. We also evaluated the features of IS element-mediated SVs-the most common SVs in bacterial genomes-in detail. IS elements are common mobile genetic elements in bacteria and play key roles in bacterial genome diversity and evolution (Ooka et al. 2009). Some SVs and complex recombination events mediated by IS elements have been found in E. coli MA lines (Lee et al. 2014;Raeside et al. 2014;Long et al. 2016). In our data sets, IS-mediated SVs dominate other SVs in both the WT and the ΔmutS MA lines, 70 (84.34%) and 43 (52.44%), respectively (table 4). The lengths of the IS-mediated SVs are extremely enriched around 500-1,000 bp (fig. 3A and 3B and supplementary tables S19  (table 4) are comparable with those reported in previous studies, for example, 3.5 × 10 −4 (95% CI: 3.2 × 10 −4 -3.7 × 10 −4 ) per genome per cell division in the same E. coli strains (Sawyer et al. 1987;Lee et al. 2016;Vandecraen et al. 2017;Consuegra et al. 2021). Among the IS-mediated SVs in the WT and the ΔmutS MA lines, transpositions by IS5, IS1, and IS2 have the top 3 rankings, with IS5 elements accounting for ∼40% ( fig. 3C and D and supplementary tables S19 and S20, Supplementary Material online). IS5 elements can insert the upstream or downstream of some operons to activate the expression of flagellar genes and glycoside metabolizing genes and thus indirectly alter the motility and glycoside utilization of E. coli (Schnetz and Rak 1992;Barker et al. 2004;Martinez-Vaz et al. 2005;Strauch and Beutin 2006;Wang and Wood 2011). Therefore, the high insertion rate of IS5 elements may be important in the migration and the niche evolution of bacteria. In addition, we find a significant correlation between the proportion of 1 type of IS elements (out of all IS elements mediating SVs) and their copy numbers in the reference genome ( fig. 4). In other words, the more IS elements of the same type in the genome, the more frequently they will mediate SVs.

Discussion
In this study, de novo spontaneous mutations of E. coli MG1655, especially the SVs, are extensively studied via different sequencing and analytical strategies. We analyze 104 final MA lines, including 67 WT and 37 ΔmutS lines. The mutation rates of BPSs and small indels are highly consistent with previous studies (supplementary tables S2, S3, and S13, Supplementary Material online) Foster et al. 2015;Long et al. 2016). For the SV detection, we conclude that the strategy based on long-read sequencing and analysis is generally superior to that based on short reads in both simulated and real data (figs. 1 and 2A-C, tables 1 and 3, and supplementary tables S14, S15, and  Genome Biol. Evol. 15(6) https://doi.org/10.1093/gbe/evad106 Advance Access publication 9 June 2023 S18-S20, Supplementary Material online). The SV rates are 2.77 × 10 −4 per genome per cell division in the WT and 5.26 × 10 −4 in the ΔmutS, which are comparable with those previously reported (Lee et al. 2016). However, it is impossible to simulate all possible SV scenarios, and the complexity of real genomic regions can affect the precision for detecting SVs (Dierckxsens et al. 2021). Therefore, when applying the pipelines tested with simulated data to real data sets, the choices of software and parameters still need to be carefully refined.
Based on the simulated and the real data analyses, longread sequencing is indeed more powerful in detecting all types of bacterial SVs, with high precision and accuracy compared with short-read sequencing (figs. 1 and 2A-C, tables 1 and 3, and supplementary tables S14, S15, and S18-S20, Supplementary Material online). Although the number of de novo SVs generated during the MA experiments is much smaller than those reported in studies on existing SVs in natural lineages, the high precision and accuracy of the long-read sequencing in SV detection are highly consistent (He et al. 2019;Mahmoud et al. 2019;Mantere et al. 2019;Chawla et al. 2021;Sakamoto et al. 2021). Analyses based on short reads show high SV falsepositive rates in bacteria, because most software were initially developed for the human genome and their algorithms ignore some SVs in simple repetitive regions in order to save computation resources (Rausch et al. 2012;Fan et al. 2014;Layer et al. 2014;Deatherage et al. 2015;Chen et al. 2016). Nonetheless, breseq and Manta are still useful in detecting deletions and other SVs, although Manta works at the cost of a high rate of false positives (figs. 1 and 2A-C and supplementary tables S14, S19, and S20, Supplementary Material online). As previously reported, the limitation of short-read sequencing in SV detection could originate from the nearby BPSs or indels around the SV breakpoints (Cameron et al. 2019). Even integrating multiple callers, false positives are still common, and its high sensitivity comes at the cost of disproportionately lower We applied 2 alternative strategies to detect SVs in simulated and real MA data: short-read-based sequencing and calling with breseq and Manta and long-read-based sequencing and calling with Sniffles. Apparently, different SV types are most amenable to different strategies regardless of the sequencing platforms. Compared with shortread-based methods, the long-read-based strategy performs better in the insertion SVs and will support future research related to SV characteristics and functions (table 1, fig. 2A, and supplementary tables S14 and S18-S20, Supplementary Material online). For identifying insertions, the advantages also apply to eukaryotes, suggesting that the ability of long reads to span longer repetitive regions and so outperforms short-read strategies (Cretu Stancu et al. 2017;Huddleston et al. 2017;Wong et al. 2018;Liu et al. 2020;Zhao et al. 2021b). For the deletion, although short-read sequencing performed well for SV detection in simulation results, long-read sequencing has higher accuracy with real data (table 1, fig. 2A, and supplementary tables S14 and S18-S20, Supplementary Material online). This may be due to the high complexity of the real situation and may benefit from the advantage of long-read sequencing even with low coverage in previous research (Kosugi et al. 2019). Tandem duplications and inversions are rare in the real data sets ( fig. 2A and supplementary tables S14 and S18-S20, Supplementary Material online), suggesting that there are relatively few tandem duplication and inversion SVs in the MA lines. These results corroborate that the 2 strategies can be combined for thorough SV detection and even short-read sequencing could be accurate enough using breseq and Manta, if deletion SV is considered only.
Previous studies on bacterial MA have primarily focused on characterizing BPSs and indels, and only limited inference about SVs based on short-read sequencing is available (Foster et al. 2015;Long et al. 2015;Kucukyildirim et al. 2016;Long et al. 2016;Strauss et al. 2017;Tincher et al. 2017;Long et al. 2018a;Pan et al. 2021;Wu et al. 2021). The SV detection strategy based on long reads has been generating numerous reliable results, for example, in the metagenomic study of lake bacterioplanktons and for detecting potential large-scale assembly errors of complex bacterial genomes with long repeat regions (Schmid et al. 2018;Okazaki et al. 2022). Our results also indicate that long-read sequencing, long-read tools, and intensive SV candidate validation with Sanger sequencing are needed to fully characterize full-scale mutations in evolved MA lines ( fig. 2A-C and supplementary tables S18-S20, Supplementary Material online).
However, although long-read detection tools have advantages over short-read ones when applied to both simulated and real bacterial data for identifying SVs, there are still some issues. Because long-read sequencing has a high error rate, it can affect the efficiency of long-read tools to detect SVs (Jiang et al. 2021). In addition, SV detection using long-read tools is also affected by the sequencing depth and SV types, for example, high sequencing depth could even reduce the accuracy of some tools (Luan et al. 2020;Dierckxsens et al. 2021;Lesack et al. 2022). Similarly, long-read tools also detect inversions unsatisfactorily, which also needs facilitation of other algorithms (Parrish et al. 2013).
The strategies outlined in this study should facilitate future research that involves SV analyses. For example, studies on gut microbiomes have shown that unique SVs can represent the genetic fingerprints of specific communities (Chen et al. 2021). The total length of SVs is almost 10× that of BPSs in our study, also demonstrating the important role of SVs in genome evolution (Korbel et al. 2007;Escaramís et al. 2015;Hämälä et al. 2021). In addition, SVs are reported to be closely associated with bacterial growth and adaptation to the environment, and their changes can also alter the immunity and metabolism of the host (Zeevi et al. 2019;Wang et al. 2021). It has also been shown that IS-mediated SVs in a population can not only promote evolution but also limit evolution after a meltdown (Consuegra et al. 2021). Advanced sequencing technologies combined with sophisticated programs would eventually push the precision and accuracy of SV detection to the point that would satisfy most biological studies. Further studies are needed in the future regarding the distribution of SV fitness effects in bacteria, and such studies would provide more insights into long-term genome evolution.

Strains and MA Procedures
All Escherichia coli strains were in the K-12 MG1655 background and generously provided by Patricia Foster's lab. Eighty WT and 40 ΔmutS MA lines were initiated and cultured on LB agar (Solarbio, Cat. No.: L8290) at 37 °C. Each line was single-colony transferred daily. We transferred each MA line 160 times on average, taking more than 5 months. In order to estimate the cell divisions between transfers (Num) by the colony-forming-units, we performed serial dilution every 10 days, by randomly choosing and razor-cutting a single colony from each of the 5 lines for the WT and the ΔmutS MA lines, respectively. Based on the formula log 2 (Num), there were, on average, 28 cell divisions for the WT lines and 27 for the ΔmutS lines between 2 adjacent transfers.

DNA Extraction, Library Construction, and Genome Sequencing
After the last transfer, we picked a single colony for each final MA line as well as the ancestral line for each strain and cultured them in the LB broth (Solarbio, Cat. No.: L8291) in quadruplicate overnight at 37 °C. One of the 4 cultures was used to extract DNA with MasterPure TM Complete DNA and RNA Purification Kit (Lucigen, Cat No.: MC85200) for Illumina sequencing. Each of the remaining 3 replicates was mixed with glycerin (10%) and stored at −80 °C. We constructed the short-read libraries of DNA that passed the concentration and quality requirements using an optimized protocol for TruePrep® DNA Library Prep Kit V2 for Illumina (Vazyme, Cat. No.: TD501-01) and TruePrep® Index Kit V3 for Illumina (Vazyme, Cat. No.: TD203). After agarose gel electrophoresis and cutting the target bands to recycle with the E.Z.N.A.® Gel Extraction Kit (Omega Bio-tek, Cat. No.: D2500-02), we obtained the libraries with insert sizes of about 300 bp. Then, PE150 sequencing was performed using 1 Illumina NovaSeq6000 sequencer at Berry Genomics, Beijing. For the WT and the ΔmutS final MA lines, we randomly chose 19 lines from each group, as well as their ancestors, to extract DNA and construct the libraries for the Nanopore long-read sequencing. The standardized mixed libraries were pooled and loaded into 1 flow cell (R9.4) and sequenced with 1 Oxford Nanopore PromethION sequencer (Benagen, Wuhan, China). Then, the electrical signals were converted into DNA bases by . Next, the adapters were removed from the data and the data was filtered with Q ≥ 7. After quality control, about 1-3 Gbp sequences for each sample were finally obtained (supplementary table S4, Supplementary Material online).

BPS and Indel Mutation Analysis
For the Illumina sequencing data, the 2 × 150 bp paired-end reads were first trimmed by Fastp (v-0.20) (Chen et al. 2018) to remove adapters and low-quality reads. After trimming, the reads were mapped to the reference genome (NC_000913.3), using the "mem" function in Burrows-Wheeler Aligner (v-0.7.17) (Li and Durbin 2009). The mapped reads were in SAM format and transformed into BAM format by SAMtools (v-1.9) ). Duplicate reads were removed by the function MarkDuplicates of picard-tools (v-2.20.1). Based on the local re-assembly feature, we used the HaplotypeCaller of Genome Analysis Toolkit (GATK, v-4.1.2.0) (McKenna et al. 2010;DePristo et al. 2011;Van der Auwera et al. 2013) with standard hard filters to call the BPSs and indels. Therefore, 13 lines were removed because of low coverage (less than 20×), cross-contamination of sequenced lines (randomly removing 1 line if 2 lines shared exactly the same BPS in the same site), or carrying mutations on repair genes (supplementary table S1, Supplementary Material online), and eventually 67 WT and 37 ΔmutS MA lines were used in the final analyses. All the indels were manually curated with the Integrative Genomics Viewer (IGV, v-2.8.2) (Thorvaldsdóttir et al. 2012).
Using the filtered BPSs and indels, we calculated the mutation rate μ with the formula: Here, n was the number of MA lines. The number of mutations for all MA lines, the analyzed sites for each MA line, and the total cell divisions during the transfers were denoted by m, N, and T, respectively. The context-dependent mutation rates were analyzed as in our previous study (Long et al. 2015).

E. coli Genome Simulation
In order to evaluate the SV detection pipelines and based on the reference genome of E. coli MG1655 (NC_000913.3), we established 4 groups of simulated genomes, each carrying known SVs of only 1 type: insertions, deletions, tandem duplications, or inversions. Each group contained 3 simulated genomes with 100, 200, or 500 known SVs. This was done using RSVSim (v-1.34.0) (Bartenhagen and Dugas 2013), a Bioconductor package in R. In addition, we also randomly simulated BPSs and indels near the breakpoints of these SVs, mainly distributed in the range of 100 bp upstream or downstream of the breakpoints. The percentages of BPSs and indels out of the total number of SVs within a breakpoint's flanking regions are 0.1% and 0.05%, respectively, and the maximum length of indels is 20 bp. According to RSVSim's built-in algorithms, 1 flanking region can contain at most 1 indel and breakpoints'coordinates of SVs in the genome follow a uniform distribution. The SV lengths in the simulated genomes of the 4 groups were set from 50 to 10,000 bp, with SV length distribution of 70% 50-1,000 bp, 20% 1,001-5,000 bp, and 10% 5,001-100,000 bp. Within the range, the length of each specific SV is randomly generated in R (v-4. Besides, we also simulated a 0-variant genome and its shortand long-read sequencing data, and the methods as well as parameters were consistent with those described above. Simulation of Illumina Short Reads and Nanopore Long Reads ART (v-2.5.8) (Huang et al. 2011) simulated the short-read data sets using the above simulated genomes with known SVs. These data sets were composed of 2 × 150 bp Illumina short reads with a mean sequencing depth of about 100×, and the mean and standard deviation for the insert sizes were 300 and 50 bp.
The simulated short-read and long-read data sets were uploaded to the NCBI SRA database (BioProject Number: PRJNA856428).

Testing the Pipelines by Detecting SVs in the Simulated Data Sets
Using the simulated data sets, we applied different analytical pipelines to identify SVs. We first performed quality controls on the simulated data sets. For the Illumina data sets, the process for obtaining the BAM files is the same as the above BPS and Indel Mutation Analysis section. For the Nanopore data sets, they were firstly filtered by NanoFilt (v-2.8.0) (De Coster et al. 2018) to keep the reads with quality score Q ≥ 7 and then corrected by canu (v-1.7.1) (Koren et al. 2017). Then, the corrected reads were mapped to the reference genome NC_000913.3 using NGMLR(v-0.2.7) (Sedlazeck et al. 2018). Next, the SAM format files were converted into BAM files and sorted using SAMtools.
The pipelines with breseq (v-0.35.1) Deatherage and Barrick 2014) and Manta (v-1.6.0) (Chen et al. 2016) were used to identify SVs using the preprocessed short-read data sets. breseq Deatherage and Barrick 2014) was a versatile tool that could mainly detect IS-mediated insertions and deletions of haploid microbial genomes. Given that breseq could not detect non-IS-mediated insertions and the random simulation introduces IS-mediated insertions at low frequency ), we only used breseq for insertion and deletion calling. breseq was used to map the clean short reads to the reference genome by BOWTIE2, then implement the split-read alignment methods, reconstruct the candidate junction sequences into a new reference, and map again to predict and annotate mutations after correcting and analyzing with the default parameters. As breseq is mainly used for detecting deletions and insertions mediated by mobile elements, we also used Manta to complement the limitations in detecting other types of SVs (insertions, tandem duplications, and inversions). Manta performs excellently in detecting SVs in human genomes based on short reads (Cameron et al. 2019).
The other pipeline was based on Sniffles (v-1.0.12) (Sedlazeck et al. 2018). We required the number of supporting reads ≥ 10 and the SV length ≥ 50 bp, with default values for other parameters. In addition, we also used NanoVar (v-1.3.8) (Tham et al. 2020) and NanoSV (v-1.2.4) (Cretu Stancu et al. 2017) to detect the SVs in the simulated data sets and then compared these results with Sniffles to choose the best-performance pipeline.
To evaluate the detection efficiency of each pipeline, we introduced 3 criteria: sensitivity, precision, and F1 score. The calculations of these values follow the confusion matrix rule. After calculating the true positives (TP), false negatives (FN), and false positives (FP), we used the formula as follows: sensitivity = TP TP + FN precision = TP TP + FP F1 score = 2 * sensitivity * precision sensitivity + precision The True Positives needed to meet 3 conditions: 1) the type of SVs must be the same as the simulated one, 2) the start position for called SV is the same as or within ±30 bp of the corresponding simulated SV, and 3) the SV length differs from the simulated one by no more than 30%. The error distribution associated with these cutoff lines is also shown in supplementary , the same Sniffles parameters as those used on the simulated data sets were performed. We subsequently eliminated the SVs existed in the ancestors from the candidate SV calls. Then, SVs detected in 3 or more MA lines in each set (either long-or short-read) were also removed.

PCR Validation of Candidate SVs
Before Sanger sequencing, we filtered out hundreds of false positives in MA lines that were also present in the ancestral line (SVs were called by Sniffles if there are sequence difference between the ancestral genome and the reference genome) and those labeled as "imprecise" by Sniffles (supplementary table S17, Supplementary Material online, FP in MA-WT and MA-ΔmutS). The SV calls after the above filtering were then validated by PCR, using Primer5.0 to design primers for each specific target region and BlastN (v-2.13.0) (Zhang et al. 2000) to confirm that primers were unique with low similarity to other nontarget genomic regions. All the primer sequences are shown in supplementary

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).